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1.  Summary  of  work  performed 

0)  Overview:  during  the  period  of  this  contract  research  was 
ongoing  in  several  areas:  stable  array  processing  (with  A.  Steele); 
two-dimensional  reconstruction  algorithms  (with  M.  Fiddy);  extensions  of 
the  concept  of  cross-entropy  distance  measures  (with  L.  Jones);  and  the 
beginning  of  new  work  on  array  processing  in  more  complicated  acoustic 
environments  (with  NORDA  personnel).  Other  work,  by  Jones  and  by  J. 
Benedetto,  was  partially  supported  by  this  contract  (title  pages  enclosed). 

1)  The  array  processing  work  (with  A.  Steele)  In  collaboration 
with  Dr.  A.  Steele  i  have  developed  a  "sector-  focused  stability"  method 
for  stabilizing  nonlinear  methods  for  localization  and  resolution.  This 
work  was  performed  at  DRCS,  Salisbury,  South  Australia  (with  partial 
support  from  the  DOD  of  Australia)  and  at  the  University  of  Lowell,  under 
subcontract.  During  the  period  of  the  present  contract  we  considered  the 
destabilizing  effects  of  short  averaging  times;  a  paper  on  this  is  in 
preparation 

2)  Two-dimensional  reconstruction  (with  M.  Fiddy)  We 

considered  the  problem  of  reconstruction  of  two-dimensional  distribution 
functions  from  spectral  values  as  a  problem  of  obtaining  finite-parameter 
approximations  to  optimal  Wiener  filters.  The  first  paper  on  this  subject 
has  just  been  accepted  by  Inverse  Problems  (galleys  enclosed)  and  a 
second,  presenting  illustrations,  is  in  preparation. 

3)  Extension  of  cross-entropy  distances  (with  L,  Jones)  Work 
of  L.  Jones  on  extension  of  cross-entropy  distance  measures  has  been 
applied  to  obtain  new  algorithms  for  incorporating  a  positivity  constraint 
in  nonlinear  reconstruction.  A  paper  on  this  subject  is  in  preparation  ( a 
first  draft  is  enclosed). 

4)  Array  processing  in  a  complex  acoustic  environment  (with 
C.  Feuillade,  D.  DelBaizo  et  al,  NORDA)  Standard  linear  and  nonlinear 
array  processing  methods  cannot  be  applied  unaltered  for  range,  depth  and 
bearing  estimation  in  shallow  water  situations  (with,  for  example,  a 
normal  mode  model  for  the  propagation).  We  have  begun  to  develop  new 
methods  for  incorporating  propagation  models  in  linear  and  nonlinear  array 
processing.  C.  Byrne  spent  a  week  at  NORDA  in  June  87,  supported  partly  by 
NORDA;  a  paper  on  this  work  is  in  progress  (first  draft  enclosed). 


2.  Background 

A  previous  contract  with  ONR,  for  three  years  and  through  Catholic 
University,  was  terminated  at  the  end  of  two  years  rather  than  continue 
the  subcontract  arrangement  to  Univ.  of  Lowell.  The  present  contract 
supported  the  wrapping  up  phase  of  the  earlier  work  with  A.  Steele  and 
with  M.  Fiddy,  as  well  as  the  beginning  of  new  efforts  with  L.  Jones  and 
with  NORDA  personnel  (C.  Feuillade  and  D.  DelBalzo).  In  addition,  John 
Benedetto  was  supported  for  two  weeks  in  his  investigation  of  the  use  of 
wavelets  and  Gabor  transforms  to  represent  transient  signals. 


Work  supported  by  ONR  Contract  N00014-K-87-0394  (Period:  May 
87-Sept87) 


Papers  accepted  for  publication: 

1 .  "Images  as  power  specta;  reconstruction  as  Wiener  filter 
approximation,"  in  Inverse  Problems  (with  M.  Fiddy); 

Papers,  iO-Br.gaarati,Q.o,; 

1 .  "Stabilizing  eigenvector  methods  of  source  localization  and 
resolution  for  the  case  of  white  noise  and  short  averaging  time,r 
(with  A.  Steele); 

2.  "On  entropy  critieria  for  solving  inverse  problems  with  positivity 
constraints,"  (with  L.  Jones); 

3.  "Stable  data  adaptive  matched  field  methods  for  ambiguity  reduction 
in  source  parameter  estimation,"  (with  D.  DelBalzo  and  C.  Feuillade); 

4.  "Reconstruction  as  filter  function  approximation:  some  algorithms," 
(with  M.  Fiddy). 

Papers  bv  other  authors  receiving  partial  support  from  this  contract: 

1 .  "Approximation-theoretic  derivation  of  logarithmic  entropy 
principles  for  inverse  problems  and  unique  extension  of  the  maximum 
entropy  principle  to  incorporate  prior  knowledge,"  by  Lee  Jones 
(accepted  by  SIAM  J.  Applied  Math.); 

2.  "Gabor  representations  and  wavelets,"  by  John  Benedetto. 

Talks: 

1 .  "Sector-focused  stability  for  high  resolution  array  processing,"  IEEE 
Workshop  on  Underwater  Acoustic  Signal  Processing,  Univ.  of  Rhode 
Island,  Sept.  87  (with  A.  Steele). 
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Abstract.  The  problem  of  reconstructing  a  non-negative  function  from  finitely  many 
values  of  its  Fourier  transform  is  a  problem  of  approximating  one  function  by  another  and. 
as  such,  is  analogous  to  the  design  of  finite-impulse-response  approximations  to  the 
Wiener  filter.  Using  this  analogy  we  obtain  reconstruction  methods  that  are  computation¬ 
ally  simpler  approximations  of  entropy-based  procedures.  Our  linear  estimators  allow  for 
the  inclusion  of  prior  information  about  oversampling  rate.  i.e.  support  information,  as 
well  as  other  prior  knowledge  of  the  general  shape  of  the  object.  Our  nonlinear  methods, 
designed  to  recover  spiky  objects,  make  use  of  prior  information  about  non-uniformity  in 
the  background  to  avoid  bias  in  the  estimation  of  peak  locations. 


1.  Introduction 

The  problem  of  reconstructing  a  non-negative  function  f[a,  b)  of  two  real  variables 
from  finitely  many  values  of  its  Fourier  transform  (ft)  arises  in  a  number  of 
applications.  These  include  recovering  an  image  or  object  distribution  from  its 
spectrum,  a  power  spectrum  from  its  autocorrelation  function,  a  distribution  of  energy 
in  bearing  from  cross-sensor  correlations  or  a  bivariate  probability  density  from  its 
characteristic  function.  In  many  cases  of  interest  the  function /(o,  b)  is  non-negative 
and  we  shall  make  that  assumption  here.  The  problem  of  limited  data  can  arise  for  a 
variety  of  reasons:  to  remove  the  effects  of  a  known  convolution-filter  degradation 
one  can  divide  by  the  filter  transfer  function  in  the  spectral  domain,  but  must  avoid 
dividing  by  small  quantities;  in  the  case  of  sensor  array  processing  one  is  limited  to 
spatial  separations  provided  by  the  array  geometry. 

Because  the  data  are  finite  there  will  always  be  infinitely  many  reconstructions 
consistent  with  the  data  values.  Some  of  these  reconstructions  will  be  reasonably 
good,  while  others  will  not;  the  data  constraints,  by  themselves,  will  not  automatically 
lead  to  a  good  reconstruction  unless  the  number  of  data  values  is  large.  There  are  J 
several  methods  based  on  minimising  some  cost  function,  such  as  entropy;  one  | 
problem  with  such  approaches  is  that  it  is  not  always  clear  just  how  the  resulting  j 
reconstruction  is  related  to  the  correct  answer.  The  methods  we  present  here  are 
based  on  the  theory  of  best  approximation  in  Hilbert  space  and  make  clear  how  the  I 
reconstruction  is  related  to  the  original,  unknown,  correct  object  function.  ! 
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In  order  to  obtain  a  good  reconstruction  it  is  necessary  to  incorporate  additional 
information  about  the  function  being  reconstructed;  in  some  cases  support  infor¬ 
mation  is  used  or  positivity  is  enforced;  in  others  upper  and  lower  bounds  are 
employed.  In  a  number  of  methods  one  uses  a  prior  estimate  of  f(a.  b);  this  is  done  in 
cross-entropy  minimisation  and  it  has  been  shown  that  the  Burg  maximum  entropy 
method  employs  (tacitly)  a  uniform  prior  estimate  [1],  In  earlier  work  [I]  work  we 
extended  the  Burg  method  for  the  one-dimensional  case  to  incorporate  other  prior 
information  and  considered  numerical  examples;  our  purpose  here  is  to  provide  a 
theoretical  justification  for  that  procedure,  based  on  analogy  with  the  design  of 
approximate  Wiener  filters.  This  approach  allows  for  generalisation  to  higher  dimen¬ 
sions,  which  we  also  consider.  We  present  both  linear  and  nonlinear  methods. 

One  of  the  difficulties  with  methods  that  incorporate  prior  knowledge  is  that  it  is 
not  always  clear  what  the  prior  estimate  is  estimating.  As  our  development  here 
reveals,  the  role  of  the  prior  estimate  is  different  in  the  linear  and  the  nonlinear 
methods.  The  chief  virtue  of  the  Wiener  filter  design  approach  is  that  it  gives  us  a  clear 
picture  of  the  role  being  played  by  the  prior  estimate.  Loosely  speaking,  in  the  case  of 
linear  methods  the  prior  estimate  is  an  estimate  of  the  whole  function  associated  with 
the  data,  including  any  noise  background  component,  while  for  nonlinear  methods  (to 
be  used  mainly  for  high-resolution  reconstruction  of  spiky  objects)  the  prior  should 
estimate  the  smooth  component  only;  linear  methods  such  as  superresolution  become 
unstable  when  the  prior  estimates  only  the  support-limited  object  and  ignores  any 
noise  background,  while  nonlinear  methods  that  employ  a  uniform  prior  estimate, 
such  as  Burg’s  maximum  entropy  [1],  become  unstable  when  the  background  is  non- 
uniform. 

In  reference  [1]  we  presented  methods  for  the  reconstruction  of  ID  objects  from 
limited  ft  data.  Here  we  extend  these  methods  to  2D  objects  and  present  a  unified 
interpretation  of  both  cases  in  terms  of  the  finite-impulse-response  approximation  to  a 
Wiener  filter;  in  this  way  we  are  led  naturally  to  the  particular  Hilbert  spaces  used 
earlier  [1],  where  they  may  have  seemed  somewhat  ad  hoc. 

It  is  important  to  note  that,  while  the  Wiener  fiber  and  its  finite-impulse-response 
approximations  are  used  to  motivate  the  reconstruction  methods  presented  here,  we 
do  not  employ  a  statistical  model  for  the  functions  being  reconstructed. 

Throughout  the  paper  we  denote  by  /(a,  b)  a  non-negative  function  supported  on 
the  square  |o|=£.t,  |b|s£;r.  The  Fourier  series  representation  for  the  function/ on  |o|, 
is 

at  at 

/(o,6)=  ^  F(m,  n)  exp(ima  +  inb)  (1.1) 


We  assume  that  we  have  the  data  F(m,  n)  for  from  which  we  are  to 

reconstruct  (estimate)  f(a.  b).  A  commonly  used  estimate  is  the  truncated  Fourier 
series  (also  sometimes  referred  to  as  the  ‘discrete  Fourier  transform’  because  the 
summation  replaces  the  integration);  for  |o|,  define  the  dft(u.  b)  to  be 

H  v 

dft(<7,  b)  =  ^  F(m ,  n)  exp(i ma  -I-  inb).  (1.2) 

-«  -v 

Note  that  the  dft  is  defined  here  as  a  function  of  two  continuous  variables;  one 
sometimes  sees  ‘dft’  used  to  denote  a  sampled  version  of  (1.2). 


,v.v 
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In  many  applications  the  orr  will  be  unsatisfactory,  particularly  if  the  function  /is 
supported  on  a  smaller  interval  within  [  -  rr,  a J  x  ( -  ,-r,  ,t],  or  if  /  is  a  spiky  function 
and  the  number  of  data  values  is  not  large.  The  dft  is  consistent  with  the  original 
data,  in  the  sense  that  the  Fourier  scries  of  dft  (a.  h)  has  the  data  values  in  the  proper 
positions,  but  may  fail  to  he  non-negative  or  to  resolve  closely  spaced  peaks.  The 
objective  of  high-resolution  processing  is  to  employ  prior  information  to  obtain  better 
reconstructions  than  the  dft. 

For  completeness  we  discuss  the  Wiener  filter  and  its  approximations,  for  the  ID 
case  (for  notational  simplicity),  and  then  discuss  the  use  of  Wiener  filter  approxima¬ 
tion  for  the  reconstruction  of  ID  functions.  We  then  turn  to  the  2D  case,  the  main 
differences  stemming  from  difficulties  in  extending  the  concept  of  ‘causal  filter’. 
Finally ,  we  discuss  briefly  the  connections  between  these  methods  and  those  based  on 
the  minimisation  of  cross-entropy. 

2.  Wiener  filtering:  the  one-dimensional  case 

The  Wiener  filter  [2]  is  a  procedure  designed  to  produce  as  output  an  estimate  of 
‘signal’  when  presented  with  input  ‘signal  plus  noise’.  Assume  that  (s(;i)},  {»(«)}  are 
independent,  mean-zero  stationary  random  sequences  with  autocorrelation  functions 
c„(m),  r„„(m)  and  power  spectra  R„(a),  Ruu(a),  respectively,  with  |o|^.t;  the  sequence 
{r„(/;i)}  are  the  Fourier  coefficients  of  /?„(a),  and  similarly  for  R„u(a).  The  Wiener 
filter  is  a  doubly  infinite  sequence  (h(k)}  designed  as  follows:  given  the  random 
sequence  x(n)  =  s(n)  +  u(n)  as  input  and  y(n)  as  output,  where 


y(n)=^h(k)x(n-k) 


select  {h(k)}  so  as  to  minimise  the  expected  mean  square  error,  £|s(/r)- y(/;)|z.  The 
well  known  result  is  that  the  optimal  choice  of  sequence  {h(k)}  is  the  sequence  of 
Fourier  coefficients  of  the  function  H{a)  =  Ri,(a)/R„(a),  where  /?4J(a)  =  R„(a)  + 
Ruu(a),  and  H(a)  is  defined  to  be  zero  if  /?.t,(cr)  —  0. 

The  Wiener  filter  is  not  a  causal  filter,  since  we  do  not  have  h(k)  —  0  for  k< 0.  We 
can  ask  for  the  causal  filter  {g{k)}  ( g(k )  =  0,  A<0)  that  best  approximates  the  Wiener 
filter,  or,  going  further,  the  finite-impulse-response  filter  (d(k)}  (d(k)  =  0  unless 
K^k^L)  that  best  approximates  the  Wiener  filter.  To  obtain  these  optimal  approxi¬ 
mations  we  minimise  the  expected  mean  square  difference  between  the  outputs  of  the 
Wiener  filter  and  the  approximation.  These  optimisation  problems  are  equivalent  to 
best  approximations  in  a  Hilbert  space  with  weighted  inner  product. 

To  obtain  the  best  causal  approximation  to  the  Wiener  filter  we  minimise  the 
distance 


/y(Q)  “  2  S(k)  cxp(iAa)  R„(a)  dct 


over  all  causal  sequences  (g(A)}.  Similarly,  to  obtain  the  optimal  finite-impulse- 
response  filter  with  support  K^k^L  we  minimize  the  error 


I  H{a)~  ^  d{k)  exp(iAr)  /?„  (c)  da 
J  --•*  I  l-A  I 
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over  all  finite  sequences  {d(k)).  From  the  orthogonality  principle  in  Hilbert  space  [3] 
it  follows  that  the  optima!  {#(£)}  and  {d(k)}  must  satisfy  the  following  systems  of 
linear  equations: 


* 

^  g(k)  r.d"i  ~  k)  mZtO  (2.4) 

A  *0 
L 

r„(»i)  =  ^  d(k)  rt,(m  —  k)  k^m^L  (2.5) 

Equations  (2.4)  are  the  discrete  Wiener-Hopf  equations.  Having  solved  these  equa¬ 
tions  we  write,  for 

«  L 

C(a)=  ^  g(k)exp(iAa)  DL(a)  =  ^  d(k)  exp(iA:a)  (2.6) 

k-<»  l«K 


It  is  worth  noting  that  the  best  finite  impulse-response  filter  approximating  {g(k)) , 
for  O^K^k^L,  is  the  {d(k}}  above;  that  is,  this  choice  of  d(k)  minimises  the 
approximation  error 


G{a)~  2  d(k)  exp(iAa) 

k-K 


(2.7) 


viewed  as  a  function  of  the  d(k).  Therefore,  the  function  Dt(a)  is  simultaneously  the 
best  approximation,  of  its  form,  of  H{a)  and  of  G(a),  in  the  Hilbert  space  with  inner 
product  weighted  by  /?„(a). 

The  error  (2.3)  is  of  interest  for  the  reconstruction  problem  because,  for  the  case 
O^SK,  a  non-negative  function  is  being  approximated  by  a  necessarily  non-real 

trigonometric  polynomial,  in  a  Hilbert  space  with  weighted  inner  product.  As  shown 
in  reference  [1],  this  is  precisely  what  happens  in  the  maximum  entropy  method  (mem) 
of  Burg  [4],  that  the  finite  polynomial  also  approximates  G(a)  is  implicit  in  the  mem  in 
the  spectral  factorisation  (5). 

In  the  next  section  we  employ  these  approximation  theoretic  aspects  of  Wiener 
filter  design  to  obtain  reconstruction  methods. 


3.  Wiener  filter  approximation  and  reconstruction:  ID  case 

We  consider  the  problem  of  reconstructing  the  non-negative  function  f(a ),  |a|^7. 
from  finitely  many  values  of  its  Fourier  coefficients,  F(m),  We  present  first 

linear  methods  and  then  nonlinear  ones. 

3.1.  Linear  methods 

Assume  that  we  have  a  prior  estimate  of  the  broad  features  of  f{a),  in  the  form  of  a 
non-negative  function  p(a ),  such  that  p{a)  =  0  only  if/(a)  =  0;  of  course,  in  practice  we 
will  not  know  where  the  support  of  /  is,  exactly,  so  p(a)  should  be  positive 
everywhere.  A  rough  idea  of  the  support  of  f  can  be  indicated  by  concentrating  p  in 
that  region.  Let  p  play  the  role  of  f  the  role  of  /?„;  we  are  effectively  assuming 
that,  for  some  e>0,  we  have  p(a)^ef(a)  for  all  a,  and  that  (apart  from  the  scaling)  p(a) 


WV.V'.V.V.W.V 


Reconstruction  as  a  Wiener  filter  approximation  5  ! 

overestimates  f(a)  everywhere;  the  settle  factor  cancels  in  the  end.  so  is  not  needed. 
Then  H—flp  is  approximated  by  the  polynomial  D'_'A,  and  the  equations  that  must  be 
solved  are 


F{m)=  ^  d(k)P{m-k)  (3.1) 

L..-M 

where  P{m)  are  the  Fourier  coefficients  of  p{a).  These  equations  arise  when  we 
minimise  the  approximation  error 


f: t  « 

f(a)-p(a)  ^  d(k)  cxp(tka)  p{a)’'  do 

J  -a  1  -  —  A/ 


as  a  function  of  the  d(A).  The  resulting  estimator  of  f(a)  is  the  pdft  [6],  so  called 
because  of  its  form: 

PDFT(<7)=p(tf)Z)i\,(tf).  (3.3) 

If  the  prior  estimate  p(a)  =  constant,  then  d(k)  =  F(k)  and  the  pdft  reduces  to 

the  dpt. 

If  the  data  are  ovcrsampled  relative  to  the  actual  support  of  f(a)  then  including 
information  about  this  support  in  the  p{a)  can  result  in  significant  improvement,  so 
long  as  regularisation  to  avoid  sensitivity  to  noise  is  used  [1],  Note  that,  although  the 
pdft  is  not  necessarily  non-negative,  it  is  data  consistent;  it  extrapolates  values  of 
F{m)  beyond  the  data  window. 

3.2.  Nonlinear  methods 

We  assume  now  that  f(a)  consists  of  two  components,  a  discrete  (delta  functions) 
component,  which  is  the  object  of  interest,  and  a  background  (continuous)  compo¬ 
nent.  about  which  we  have  some  prior  information;  let  p(a)  be  our  non-negative  prior 
estimate  of  the  background  component.  Letting  p(a)  play  the  role  of  /?„  and  f(a)  the 
role  of  /?„,  we  see  that  H  =  plf;  for  the  filter  function  £>*  to  remove  from  Rrx  =  /  the 
component  associated  with  Rw,=f-p  it  must  place  nulls  near  the  values  of  the 
support  of  the  discrete  component.  We  perform  the  calculations  to  obtain  the  D £  and 
then  examine  the  nulls.  We  have  some  freedom  in  the  choice  of  the  K  and  L;  two 
choices  are  of  particular  importance;  (i)  A//2^L  -  —  K;  (ii)  L  =  M.  K  =  0.  The  first  has 
as  a  special  case  the  symmetric  linear  predictor  (slp)  [7,  8],  w  hile  the  second  includes 
Burg's  MEM 

In  case  (i)  we  solve  the  equations 

L 

P(m)=  Y,  d{k)F{m-k)  \m\aL  =  M/2  (A/even)  (3.4) 

k--L 

and  use  the  fact  that  DiL  approximates  H  =  plf  to  obtain,  as  the  estimate  of  /,  the 
centred  inverted  pdft  (cirDFr): 

cipdft(o)=p(o)/D!ii.(u).  (3.5) 

If  the  prior  p{a)  =  constant,  then  (3.5)  becomes  the  symmetric  linear  predictive 
method  of  Johnson  [7],  Note  that  if  the  support  of  /is  properly  contained  within  the 
support  of  p  then,  in  order  to  obtain  the  d(k)  that  are  optimal  for  that  /  and  p  it  is 
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necessary  to  replace  P{in)  in  (3.4)  by  the  corresponding  Fourier  coefficient  of  the 
function  that  is  p(a)  on  the  support  of  / and  zero  otherwise.  In  practice  one  does  not 
know  the  support  of /;  our  point  is  rather  that  (3.4)  docs  not  provide  the  optimal  d(k) 
for  such  pairs  p  and  /. 

The  role  of  the  prior  p(a),  in  the  nonlinear  methods,  is  to  reduce  bias  in  the 
estimate  of  peak  locations;  if  we  estimate  the  background  component  badly  then  the 
filter,  as  it  tries  to  eliminate  the  f(a)  —  p(a)  features,  must  null  out  (true  back¬ 
ground  - p)  as  well  as  the  discrete  component.  With  limited  freedom  to  place  nulls, 
bias  is  unavoidable.  This  has  been  shown  to  be  a  problem  with  mem.  when  used  on 
oversampled  data  [1],  and  is  due  to  the  assumption,  implicit  in  mem.  that  the 
background  is  constant  over  [  -  nr,  .t).  We  consider  mem  next,  as  a  special  case  of  ( ii). 

For  case  (ii)  we  have  K  =  0.  L  =  M  and  we  solve  equations 

At 

P(m)  =  2  d(k)F(m-l)  (3.6) 

i  >11 

to  obtain  the  filter  function  D,V;  we  view  this  function  now  as  an  approximation  of  G, 
not  of  H=plf.  The  discrete  Wiener-Hopf  equations  (2.4)  are  equivalent  to  the 
statement  (/?„)♦  =  (/?„<j)*,  where  by  (#„).,  we  mean  the  causal  part  of  the  Fourier 
series 

K 

(KuK(a)=  ^  ^(m) exp(iwa)  |a|=£;r  (3.7) 

m  •  0 

and  similarly  for  other  functions.  Equations  (3.6)  tell  us  that  the  two  causal  functions 
p+  and  (/D,V)+  have  identical  Fourier  coefficients,  out  to  index  in  =  M.  Because  D”  is 
a  finite  polynomial  we  can  rewrite  /£)”)*  as  (fD").  =/„£)"  +  /..  where  j.  is  a  finite 
causal  polynomial  involving  only  known  values: 

Af-  \ 

/»=  2 

m  •  I 

From  p  +~(fDu)  +  =  f.D"  +  j\  we  obtain  an  estimate  q  of /„; 

q{a)  =  (p.{a)-j^))ID!!{a)-  (3.9) 

from  /  =  2Re(/+)  -  F( 0)  we  obtain  the  inverse  pdft  (ipdft)  estimate  of  /  itself: 

ipdft(o)  =  2Rc(<7(<7))  -  F(0).  (3.10) 

Consider  the  complex  polynomial  D{z)  =  rf(0)  +  d{  1)2  +  ...  +  d(,\f)z If  the  roots 
of  D(z)  are  outside  the  unit  circle  (the  minimum  phase,  or  mp  property)  then  II D',' (a) 
is  also  causal  and  so  is  q(a).  It  can  be  shown  easily  that,  if  l/£>"(a)  is  causal,  then 
ipdft(<7)  is  data  consistent,  although  it  may  not  be  non-negative.  Although  it  is  not 
always  the  case  that  D{z)  has  the  mp  property,  it  is  frequently  the  case  in  practice,  and 
the  pdft  is  usually  data  consistent.  If  the  prior  p(a)  =  constant,  then  the  ipdft  reduces 
to  Burg’s  mem,  the  D(z)  has  the  mp  and  the  mem  is  data  consistent  (as  well  as  non¬ 
negative). 

As  remarked  earlier,  the  mem  has  been  observed  to  perform  poorly  when  the 
Junction /is  concentrated  in  a  smaller  region  of  [-,t,  t];  this  is  because  the  p(a)  is  a 
constant,  while  the  background  is  not  evenly  distributed  over  all  of  [  - .t,  .t].  Because 
the  ipdft  is  free  to  take  on  negative  values  it  could  be  used  to  gauge  the  accuracy  of 
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the  prior  being  used;  significant  negative  values  should  indicate,  that  our  p{a)  is  not 
accurate.  We  have  not  obtained  a  quantitative  measure  of  the  significance  of  negative 
values,  however. 

We  consider  now'  the  extensions  of  these  methods  to  the  two-dimensional  case. 
Although  much  remains  essentially  the  same  the  absence  of  an  obvious  generalisation 
of  the  notion  of  causality  affects  the  extension  of  the  ipdft. 

4.  The  two-dimensional  case 

As  in  the  one-dimensional  case,  the  power  spectrum  of  the  input.  7?,,(a,  /9).  is  the  sum 
of  two  components,  /?„( a.  /?)  =  /?)  +  /3).  and  the  Wiener  filter  is  the 

doubly  indexed  sequence  {h(j.  k )}  of  Fourier  coefficients  of  the  function  H{a, 
P)  =  R„{a<  /?)//?.., (a.  /?)•  To  obtain  a  finite-impulse-response  approximation  to  the 
Weiner  filter  we  minimise  the  following  error  of  approximation,  as  a  function  of  the 
d(j,  k): 

\H{a,B)-DiMa,fi)\2Ru(a,P)<ia6P  (4.1) 

-.7 

where 

J  L 

&I.K  (a.  P)  =  X  X  d^' k )  exp(i/a  +  /Av3).  (4.2) 

jm\  ImK 

In  the  two-dimensional  case  the  problem  is  to  reconstruct  the  non-negative 
function  f(a,  b ),  |a|,  |7>|^;r,  from  finitely  many  values  of  its  Fourier  coefficients,  F(m, 
n),  |m|<A7,  |/i|s£iV.  As  in  the  one-dimensional  case  we  consider  estimates  of  fa,  b) 
obtained  by  analogy  with  the  problem  of  approximating  the  Wiener  filter. 

4.1.  Linear  methods 

Assume  that  a  prior  estimate  of  the  general  shape  of  f(a ,  b)  is  given  by  the  positive 
function  p(a,  b),  and  that  {P(m,  n)}  are  its  Fourier  coefficients.  As  before,  we  let  p 
play  the  role  of  R /  the  role  of  /?„,  so  that  the  Wiener  filter  is  H  =  fp. 

For  fixed  7,  J,  K,  L  the  optimal  finite-impulse-response  filter  function  is  D{a. 
0)  =  DJij-(a ,  /?)  where  the  coefficients  of  D  satisfy  the  equations 

J  L 

F(m,  n)  =  ^  y'  d{j.  k)  P(m  -j,  n  -  k)  |m|^A7,  |/t|^A'.  (4.3) 

/-I  t-K 

Having  found  the  d(j,  k)  we  use  the  fact  that  D  approximates  H 
estimate  of  /: 

pdft(<7,  b)  =  p(a,  b)  D(a,  b). 

The  obvious  choices  for  7.  /,  K,  L  are  -7  =  7  =  47,  -  K  =  L  =  TV. 

4.2.  Nonlinear  methods 

We  assume  now  that  fa.  b)  has  a  discrete  component  of  interest,  as  well  as  a 
background  component  estimated  by  the  non-negative  function  p(a ,  b ).  As  in  the  ID 


= fp  to  obtain  our 
(4.4) 
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case  we  let  p  play  the  role  of  /?„,/ the  role  of  /?„.  so  that  :he  approximate  Wiener  filter 
attempts  to  null  out  the  discrete  component.  If  the  support  of  /  contains  the  support  of 
p  then  the  equations  to  be  solved  for  the  optimal  finite-impulse-response  filter  {d(j, 
k)}  are 


P(m.  n)  —  ^  d(j,  k)  F(m  —j,  it  —j)  i^nv^J,  K^n^L.  (4.5) 


/-I  l-K 


The  choices  for  I,  J,  K,  L  will  be  restricted  by  the  available  data,  since  the  data  make 
up  the  entries  of  the  matrix  that  appears  in  the  system  of  equations  to  be  solved.  We 
consider  here  two  possibilities. 

(i)  Let  J  =  - 1  =  Ml 2.  L-  —  K=  M/2  (M.  M  even).  Solving  (4.4)  for  the  d(j,  k)  we 
view  D  =  DJ,  *  as  an  approximation  of  H  =  plf,  so  that  our  estimate  of  /  is  the 
two-dimensional  version  of  (3.5): 

cipdft  (a.  b)  =  p(a,  b)/D(a,  b).  (4.6) 

(ii)  Let  1=  K  =  0,  J  =  M,  L  =  N.  Then  D  can  be  viewed  as  an  estimate  of  the 
first-quadrant-indexed  component  of  H,  which  we  denote  by  With  the  first- 
quadrant-indexed  component  of  p(a,  b)  given  by 


p(a,  b)++  =  ^  ^  P(m,  n)  exp(i ma  +  inb)  (4.7) 

equations  (4.4)  state  that  p(a,  b)+  +  and  [f(a.  b)D(a ,  b)]**  have  the  same  Fourier 
coefficients,  for  indices  O^m^M,  As  in  the  ID  case,  we  can  write  [/(a, 

b)D(a,  b)]„+=/(<i,  b)++D(fl,  b)+j(a,  b)  +  +  ,  where  j(a,  b)**  is  a  firs' -quadrant- 
indexed  function  that  involves  only  known  values.  Our  estimate  of  f(a ,  b  ,+  is  then 
q(a,  b)  =  [p(a,  b)*„-y(a,  b)  +  .]ID(a,  b),  which  may  not  itself  be  first-  ]  ladrant- 
indexed.  Repeating  this  procedure  three  more  times,  for  each  quadrant,  we  estimate 
/(a,  b)  by  summing  the  four  estimates  so  obtained,  taking  care  to  subtract  con.  "'onents 
included  in  more  than  one  estimator.  The  resulting  estimator  we  call  the  ipdi  r. 


5.  Relation  to  other  methods  j 

The  reconstruction  problem  considered  here  is  to  obtain  the  function  /(a,  b)  from  .he 
values  F{nt,  n),  |m|^Af,  \n\N,  where 

F(m,/i)  =  J  J  f(a,  b)  exp[  -  (ima  +  inb)]  da  db/4;r;  (5.1) 

that  is,  we  are  attempting  to  solve  the  integral  equation.  The  survey  paper  by  Frieden 
[9]  describes  a  number  of  approaches  to  this  problem. 

When  the  p(a,  b)  is  chosen  to  incorporate  support  information,  so  that  p{a,  b)  =  1, 
|fl|<A<-r,  |b|<fl<,7,  and  p(a,  b)  =  0  elsewhere,  a  small  amount  of  noise  in  the  data 
can  cause  degradation  of  the  pdft  estimator.  It  is  safer  to  make  p(a,  b)  =  c>0,  instead  ■ 
of  p(a,  b)  =  0.  This  is  a  form  of  regularisation  and  is  in  keeping  with  the  requirement 
that  the  support  of  p  be  no  smaller  than  that  of  /.  Since  the  pdft  performs  an  • 
approximation  of  the  function /in  the  (a,  b)  domain  and  smooths  the  effects  of  noise 
(if  regularised),  it  resembles  the  methods  of  Phillips  [10]  and  Twomey  [11].  The  main 
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differences  are  that  the  pdft  retains  the  continuous  formulation,  rather  than  discretis- 
ing  /(a,  b ),  and  employs  a  prior  estimate  of  the  function  /. 

Jt  might  appear  that  the  pdft  is  related  to  the  Heistrom-Weiner  ‘sharpness- 
constrained'  method  [12],  The  latter  is  based,  however,  on  a  statistical,  or  ensemble, 
model  for  the  restoration  problem  and  employs  power  spectra  of  /  and  /?;  the  mean 
squared  error  is  calculated  in  the  usual  L'  norm,  rather  than  with  a  weighted  norm  in 
(<j,  b )  space.  In  the  approach  presented  here  we  do  not  postulate  the  existence  of  an 
ensemble  of  object  functions  /  to  be  restored  and  the  idea  of  Wiener  filtering  is 
introduced  only  to  borrow  the  weighted  error  criterion  used  for  approximating  non¬ 
negative  functions  by  polynomials.  The  Backus-Gilbert  [13]  method  is  similar  in 
philosophy  to  the  Heistrom-Weiner  approach  but  there  is  only  a  superficial  connec¬ 
tion  to  the  estimators  presented  here. 

Because  the  finite  data  are  typically  insufficient  to  determine  a  single,  unique 
solution  to  the  reconstruction  problem,  one  is  faced  with  the  task  of  selecting,  from 
among  the  many  possibilities,  one  particular  answer.  The  general  feeling,  which  we 
share,  is  that  the  selection  should  not  be  arbitrary  but  should  be  guided  by  some 
reasonable  principles  of  inference.  At  this  point  there  is  some  disagreement  concern¬ 
ing  which  principles  of  inference  are  to  be  called  reasonable.  In  an  attempt  to  resolve 
the  situation  Shore  and  Johnson  [14]  developed  an  axiomatic  basis  for  the  principle  of 
cross-entropy  minimisation  and  Jones  [15]  has  recently  provided  an  approximation- 
theoretic  argument  for  the  same  method. 

Among  all  functions  g(o,  fc)>0  consistent  with  our  data  we  could  select  the  one  for 
which  the  Shannon  entropy 


entropy(g) 


b)  log  g(a,  b)  dadb 


is  maximised.  Generally,  there  is  no  closed-form  solution  and  iterative  procedures  are 
employed. 

If  there  is  available  a  prior  estimate,  p(a,  fi).  of  /(a,  b)  then  (5.2)  is  replaced  by  the 
cross-entropy  of  g,  given  p: 


cross-entropy(g|/7) 


b)  !og[g(a,  b)!p{a,  b)]  dadb  (5.3) 


The  method  of  ‘minimisation  of  cross-entropy’  (mce)  has  us  select,  as  the  estimate  of 
f(a.  b),  that  data-consistent  g(o,  b)> 0  for  which  the  integral  in  (5.3)  is  minimised.  The 
optimal  solution  then  has  the  form 


mce  (a,  b)  =  p(a,  b)  exp 


(M  A' 

2  I  <«. 

jrn-m  t--  V 


k)  exp(i;a  +  ibk) 


If  p(a,  b)  is  a  good  prior  estimate  then  the  sum  in  the  exponential  term  will  be  near 
zero;  approximating  exp(x)  by  1  +x  leads  to  an  estimator  of  the  pdft  form.  If  it  is 
known  that  the  function  f(a.  b)  is  spiky,  then  the  sum  in  the  exponential  term  will  have 
significant  negative  values.  If  we  estimate  exp(i)  by  1/(1  -x),  then  this  is  better  than 
1  +x  for  negative  x\  making  this  approximation  in  (5.4)  leads  to  the  cipdft  form. 

When  the  p(a,  b)  is  constant  over  the  support  of  the  object  function  f(a,  b)  the 
pdft  (4.4)  provides  a  minimum  energy  extrapolation  of  the  data,  consistent  with  the 
support  constraint.  In  reference  [16]  we  considered  the  problem  of  reconstructing/^. 
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b)  from  only  the  magnitude  data,  \f{m,  »)|,  !/h|«AY,  |/t|«/V,  that  is,  the  phase  retrieval 
problem.  When  arbitrary  phases  are  assigned  to  the  magnitude  data  and  the  pdft 
energy  calculated  one  finds  the  energy  to  be  dependent  on  that  choice  of  phases  and 
therefore  to  provide  a  useful  cost  function  to  direct  the  search  for  the  correct  phases. 


6,  Conclusions 


In  this  paper  we  have  considered  the  reconstruction  of  a  non-negative  function  from 
finitely  many  values  of  its  Fourier  transform.  We  have  extended  to  the  2D  case 
methods  previously  presented  for  ID  reconstruction  [1]  and  obtained  a  new  derivation 
of  these  estimators  based  on  analogy  with  the  design  of  approximate  Wiener  filters,  in 
which  the  object  function  to  be  reconstructed  and  our  prior  estimate  play  the  roles  of 
input  and  output  power  spectra.  To  obtain  linear  estimators  we  let  our  prior  estimate 
play  the  role  of  the  input  power  spectrum,  allowing  the  filter  to  extract  those  features 
not  found  in  the  true  object.  To  obtain  nonlinear  estimators  for  spiky  objects  with 
continuous  backgrounds  we  estimate  the  background  function,  and  then  let  it  play  the 
role  of  the  output  power  spectrum;  the  true  object  function  then  plays  the  role  of  the 
input  power  spectrum,  so  that  the  filter  attempts  to  null  out  the  discrete  component. 

The  linear  methods  are  extrapolation  procedures  that  are  particularly  useful  when 
the  data  are  oversampled.  The  nonlinear  methods  generalise  the  Burg  maximum 
entropy  method,  for  the  ID  case,  and  provide  computationally  inexpensive  2D 
approximations  to  other  entropy-based  methods. 

The  methods  presented  here  are  derived  using  the  best  approximation  in  weighted 
Hilbert  spaces;  the  linear  equations  to  be  solved  in  each  case  are  the  normal  equations 
that  arise  from  such  a  best  approximation  and  we  know  what  is  being  approximated  by 
what  in  each  case. 
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ABSTRACT 

Data  adaptive  methods,  such  as  Capon’s  maximum  likelihood  (ML) 
method,  suppress  sidelobe  structure  and  reduce  ambiguity  in  source 
parameter  estimation  because  they  are  optimized  against  unwanted 
terms  actually  present  in  the  data,  rather  than  against  an  a  priori  mode! 
of  what  could  be  present.  When  the  noise  component  resembles  potential 
source  terms  and  produces  a  reduced  rank  cross-sensor  correlation 
matrix  in  the  noise-only  case  methods  such  as  ML  can  become  unstable. 
By  employing  a  "reduced  rank"  111  estimator  we  can  avoid  this 
instability.  This  method  is  analogous  to  the  "sector-focused  stability" 
method  recently  developed  by  Byrne  and  Steele,  and  is  derived  by 
considering  the  general  problem  of  suppressing  ambiguity  in  parameter 
estimation. 


The  generalized  sidelobe  problem  and  optimal  suppression 

A  problem  that  arises  often  in  applications  is  the  following  one. 
stated  here  in  general  terrns: 

The  matching  problem:  Let  6  be  a  family  of  (possibly  vector) 
parameters  0.  and  let  P={q(6)  |  0  in  0}  be  a  family  of  pairwise  linearly 
independent  N-dimensional  vectors  parametrized  by  the  0  in  ©.  Our  data 
consists  of  the  single  vector  p=p(0q).  taken  from  P  and  the  problem  is 

to  determine  the  value  0=0g. 


if  the  members  of  P  are  quite  distinct  and  0  is  a  small  finite  set 
then  the  problem  is  easily  solved  by  inspection.  More  commonly,  the  set 
0  is  a  continuum,  the  vector  function  p(0)  continuous  in  0,  and  the  data 
vector  a  noisy  version  of  p(e0).  The  usual  approach  in  such  cases  is  to 

perform  linear  filtering,  such  as  simple  matching  via  the  dot  product. 

3nd  base  the  decision  on  the  outcomes  of  the  filtering. 

Linear  filtering  solution:  Consider  each  member  0  of  0  in  turn,  hold  0 
fixed  and  select  a  linear  filter  f-f(e),  a  N-dimensional  vector  with 
entries  dependent  on  0.  Let  y(©)=  j  f+£j  *■-  f+2£*f  be  tne  (magnitude 
squared  of  the)  filter  output.  We  know  from  Cauchy's  inequality  that 
u(0H({/ p)(f* f).  witn  equality  if  and  oniy  if  f  -up  .  for  some  scalar  <x: 
from  the  function  y(0)  we  can  then  determine  G n. 

Example:  For  each  9  let  f(9)=  p(9)  /  /(p(0)+E>(0))  .  Then  y(9)<  with 
equality  if  and  only  if  9=9g.  This  method  we  shall  call  simple  matching: 

the  graph  of  the  function  y(6)  is  usually  called  the  ambiouitu  surface. 

If,  in  the  simple  matching  approach,  the  output  y(0j)  is  large  for  a 
value  of  0)x0q.  we  say  that  there  is  a  sizable  sidelobe  at  0j.  for  the 
true  value  8q.  We  shall  describe  two  general  procedures  for  reducing  the 

sidelobe  effect,  the  first  a  linear  method  that  does  not  depend  on  tne 
actual  data  obtained,  and  the  second  3  data  adaptive  method  analogous  to 
Capon's  maximum  likelihood  method  (Ml  M)  for  spectrum  estimation. 
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A  linear  filtering  method  to  optimally  reduce  sidelobes 

Denote  by  y(©;0o )  the  magnitude  squared  of  the  filter  output  f(9)*£ 
corresponding  to  fixed  value  9  and  true  value  Gq.  For  any  particular 
problem  we  do  not  know  the  value  of  6q,  which  could  be  any  member  of 
0.  For  those  9*9q  we  want  y(0;©Q)  to  be  small;  since  we  do  not  know  6q 
let  us  make  the  average  value  of  y(0;6Q)  small,  as  6q  ranges  over  the 
various  members  of  6.  Specifically,  for  fixed  0,  select  that  filter  f(0) 
for  which  jy(0;0o)/p(0Q)*p(eQ)  d0Q  is  minimized,  subject  to  the 

constraint  y(0;0)=l;  the  precise  meaning  of  the  integral  will  be  dear 
from  the  context.  We  can  formulate  this  in  matrix  language  as  fo!!ows: 

Minimize  f*Af  ,  subject  to  f+p(@)=l,  where  A  is  the  N  by  N  matrix 
with  entries  Am  n=  f  £>(©o)rn  p(0o)*n  deo  •  and  subscripts 

denote  the  particular  entry  of  the  matrix  or  vector.  Using  the  normalized 
vectors  u(0o)-  E>(e0)/vf(p(e0)*p(e0))  we  C3n  write  A=  Ju(e0)u(e0)+  d90. 

The  optimal  filter  is  then  1^(0)=  X(e)A"!gfe)  .  where  the  parameter 

>v(0)=  I/e>(6)+A”  'p(e)  ,  provided  that  A  is  invertible.  The  value 
0PT(e)=|lOpt(e)+a| ^  =  |a(e)+A~ip.|2/|p(6rA~,p(e)|2  is  then  the 

function  we  want  to  use  to  compute  the  optimized  ambiguity  surface. 

If  the  matrix  A  is  not  invertible  then  pseudo- inversion  is  used  to  obtain 
the  optimal  filter;  this  will  be  the  case  in  the  normal-mode  situation 
considered  below. 

For  f  ixed  0  the  optimal  f liter  f(0)  operates  on  the  data  vector  p  and 
we  want  the  value  j  f(0)*tj^  to  be  small  if  0*0g.  The  optimal  filter  is 
designed  to  make  this  value  sm3ll,  on  average,  but  the  actual  data  we 
have  is  a  particular  £>(0O);  we  do  not  really  care  if  jft0)*p(9|)|  2  is 
large  for  other  values  0|s0q.  since  f(9)  does  not  nave  to  operate  on 
p!,0|).  Data  adaptive  methods,  such  as  Capon's  mil.  optimize  the  filter 

against  what  is  actually  there  in  the  data,  not  against  a  ciass  of 
potential,  but  mostly  not  actual,  data  vectors.  We  consider  next  me 
extension  to  our  genera!  problem  of  data  adaptive  approach  of  Capon 
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In  practice  the  data  may  consist  of  several  p.  each  a  noisy  version 
of  p(Gq).  Of  interest  then  is  the  average  over  all  the  p  of  the  value  of 

the  output  y (©:©q );  average  output  =  <ji(0)*pp>  =  f(e)+<pp+>f(e)  = 

f(6)+Rf(G).  where  <  >  denotes  averaging  over  the  available  p  and  R  is  the 
matrix  R=<pp*>.  For  fixed  0  let  us  find  that  filter  f(0)  for  which  this 
averaged  output.  f(0)*Rf(e).  is  minimized,  subiect  to  the  constraint 
1(9)* p(9)=l.  This  "maximum  likeliho od“  solution  is  easily  seen  to  be 
Lph ;(©)=  X(e)R'1p(e).  where  the  parameter  is  X(0)  =  l/p(9)+R~  ’p(0).  The 

averaged  output  is  then  111(0)=  l/p(0)+R' 'p(0);  here  it  is  assumed  that 
one  effect  of  averaging  and  of  noise  is  to  make  the  rank  of  R  equal  to  N. 
Note  that  OPT(9)  differs  from  ni(e)  in  that  A  is  replaced  by  R  in  ML;  it 
is  in  this  sense  that  ML  is  data  adaptive. 

The  HL  approach  will  generally  outperform  the  OPT  method  because 
it  can  employ  its  algebraic  freedom  to  reduce  sidelobe  effects  coming 
from  what  is  actually  present  in  the  data,  rather  than  to  guard  against 
potential  threats  that  are  most  likely  not  present. 

Because  loss  of  resolution  in  linear  estimation  is  a  particular  form 
of  sidelobe  problem  the  m  approach  typically  achieves  better  resolution 
than  linear  methods.  If  the  data  vector  is  a  superposition  of  two  or  more 
members  of  P  .  say  p=p(0o)  *  p(0|),  with  0q  and  9,  similar,  then  the 

simple  matching  method  may  result  in  |  L(0)*p J  ^  being  largest  at  0=0o. 

where  ©2  is  neitner  e0  nor  6,.  but  is  near  each.  We  could  say  that  in 

this  case  the  sidelobe  at  caused  by  Oj  is  added  to  that  caused  by  0g. 

resulting  in  a  maximum  between  the  two  correct  values.  Because  ML  can 
do  a  better  job  of  suppressing  these  sidelobes  it  can  resolve  when  the 
simple  matching,  or  even  the  OPT  method,  cannot. 

In  many  practical  situations  the  data  vector  includes  an  additive 
random  noise  component.  Depending  on  the  statistical  behavior  of  this 
noise  component  the  performance  of  the  111  method  can  vary 
considerably.  In  [  ]  we  discussed  the  instability  uf  the  ML  estimator  of 
bearing  of  planewave  source  fields,  in  the  presence  of  spatially 
concentrated  noise  and  systematic  phase  errors.  The  instability  arises 
when  the  noise  component  "iooks  like"  potential  signals  and  corresponds 
to  a  reduced  ram-  component  of  the  matrix  n. 
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In  the  next  section  we  shall  consider  the  OPT  and  HL  methods  for 
the  case  of  normal  mode  propagation  in  shallow  water;  here  also  the 
noise  component  can  lead  to  instability  of  the  ML  method  and  we  shall 
need  to  develop  more  stable  procedures. 


Normal-mode  propagation  and  signal  processing 


The  field  at  range  r=0  and  depth  z  excited  by  a  unit  amplitude  point 
source  at  range  r=r0  and  depth  z0  in  a  waveguide  is 


n 

P(z)  =  P( z;r0.z0)  =  mi  V  Sm(r0.z0)  Um(z).  ( 1 ) 

m=  I 


where  Urn(z)  is  the  rnth  modal  eigenfunction  of  the  depth-dependent 
boundary  value  problem,  and  Srn(rQ,ZQ)  is  the  modal  amplitude  value, 
given  by 

sm^r0'z0)=  exp(37Ti/^)exp(-3rr0Mkmr0)Um(20)7(27r/lcmr0) .  (2.) 

The  pressure  field  is  sampled  using  a  vertical  array  of  sensors,  at 
depths  zn,  n=l,...,N,  and  single  frequency  components  extracted  via  FFT  to 

produce  the  vector  p=(P(zj) . P(zN))T.  This  vector,  obtained  (ideally) 

from  sampling  the  signal-only  field,  can  be  written  as  p=Us,  where  U  is 
the  N  by  M  matrix  with  entries  Unm=Um(zn),  and  s  the  M  by  I  vector 

with  entries  sm=Sm(r0.20). 

in  the  case  of  a  high- loss  bottom  most  of  the  noise  reaching  the 
array  is  unaffected  bu  the  bottom,  so  contributes  to  p  a  vector  of  the 
form  .  whose  entries  are  (possibly  correlated)  random  variables.  If  the 
bottom  is  low-loss  then  noise  energy  can  excite  the  modal  structure  and 
contribute  to  p  a  component  of  the  form  U£.  where  2.  is  an  M  by  I  vector 
whose  entries  are  random  variables  representing  the  aggregate 
excitation  of  each  mode  by  the  superposition  of  noise  sources.  We  can 
therefore  write  the  data  vector  as  p  =  Us  +  U2.  •  u  =  Ux  +  .  with 

randomness  entering  through  the  3  and  the  tj.;  when  one  considers  the 
effects  of  rough  surface  scattering  on  the  signal  one  includes  random 
Phase  and  amplitude  modulations  of  the  entries  of  s.  The  matrix 


R  =  <P£*>  now  has  the  form 


R  =  U<ss*>U+  4  U<2f£4  >U+  -  <JUL*>  =  Uss'll*  4  UQU-*  4  G  .  (3) 

We  shall  assume  that  the  non-modal  noise  component.  G.  is  3  multiple  of 
the  identity  matrix,  representing  spatially  uncorrelated  noise,  such  as 
sensor  noise.  The  interesting  term  in  (3)  is  UQU\  the  modal  noise 
component  of  R. 

Typically  the  number  of  modes,  M.  will  be  less  than  the  number  of 
sensors,  N,  so  that  the  rank  of  the  matrix  UQU*  is  at  most  II.  Because 
the  signal  component,  Us,  and  the  noise  component,  U3.  both  lie  in  the  M- 
dimensional  linear  span  of  the  H  columns  of  U  the  noise  looks  like 
potential  signals  and  we  expect  ML  to  exhibit  the  sort  of  instability  we 
discussed  earlier.  We  shall  return  to  this  point  when  we  discuss  stable 
nonlinear  methods;  for  the  moment  we  consider  the  OPT  method  in  the 
context  of  normal  modes. 


Optimal  linear  processing  in  the  normal  mode  case:  Let  us  denote  by 
E>(Sq)=  the  vector  of  field  samples  we  would  obtain  in  the  case 

of  only  a  single  source  at  0g=  (tq.Zq).  Let  us  normalize  j>(9q)  to  get 

Uw0)=a(0o)/A'£(0Q)4£(0Q)).  The  A  that  represents  the  totality  of 

potential  source  vectors  is  now  A=|u(9q}u(6q)+  d6g  .  where  integration 

with  respect  to  9g  means  over  CKZgil-Uchannel  depth,  and  over  Oir0<«*. 

and  d0Q=rdQd^,  From  (I)  and  (2)  it  follows  that  the  matrix  A  has  the 

form  A  =  UBU+,  where  B  is  the  matrix  with  entries  Bm  k  Given  by 

5m.k  =  1  5m^r0-z0^'k^r0-z0^  |  l^rO*zoM  I  “  rOdrOdzO  •  ^ 

and  J  |c(r0.z0)|  |-V(p(rg.Zg)  p(rg,ZQ)). 


The  minimization  problem  to  be  solved  is:  Minimize  [Vd.  subject  to 
f  p(9)=  i .  This  is  equivalent  to:  Minimize  v4By,  subiect  to  v\sfr.z)=  1 . 
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where  v=lf  f  and  £i/.z)=Us(r,z).  The  optimal  v  is  v^pcxB"  's(r.z)  for 
o.=  l/s(r,z)*B‘  ’sfr.z)  .  The  optimal  f  liter  f^  is  found  as  follows: 
assume  that  f^p^Uw  for  some  w;  then  yop^u*f0pt_rU',Uw.  so  that 
w-(U*U)",yopt  and  £^=11(1/ U)"  ’v^.  The  optimal  estimator  is  then 

OPT(r,z)=<  |  f0pt*e.(r.z)  |  2>=c*:2s(r.z)*C*RCs(r,z).  (5) 

where  C=U(U+U)_,B']. 

N 

Note  that  the  entries  of  U*U  are  (U*lOm  .=  T  Um(zn)U](zn)  . 


Even  if  \  um(z)Uj(z)Hdz  =0  for  m=j.  it  may  not  happen  that  ^ 

2=0 

(U+U)m  j=0  .  for  m*j;  the  latter  involves  the  locations  of  the  sensors, 

whereas  the  former  does  not.  Because  the  integral  is  only  over  the  water 
column  and  does  not  include  the  bottom  it  rnay  not  be  zero  either,  as  in 
cases  such  as  the  Pekeris  model  in  which  0 <.z<*>  are  the  limits  on  z. 


fiode-f ilterina  as  an  approximation  to  the  optimal  processor:  We  obtain 


the  mode  filtering  procedure  discussed  in  Shangl  ]  by  making  several 
simplifying  assumptions  in  the  computation  of  the  matrix  B:  specifically, 
let  us  assume  that  |  | HvTq.Zq)  |  |  is  constant,  as  a  function  of  (Tq.Zq); 

and  that  the  functions  Um(ZQ)  are  orthogonal  over  the  interval  0<Zq<h. 
Then  the  entries  of  B  are  Brn  ,=o.  m*j.  Bm  m=  I /2£>mkriV  so  that  5  is  a 
diagonal  matrix.  If  we  assume,  in  addition,  that  the  values  3m  and  km  do 

not  vary  greatly  with  m.  then  B  is  (a  multiple  of)  the  identity  matrix  and 
so  C=U(U+U)"'.  The  OPT  estimator  is  then  approximated  by  the  mode 
filtering  estimator  (MFE) 

f1FE(r.z)=  c<2i(r.z)+(U+U)',U+RU(U*U)',s(r.z)  .  (“) 

with  rV-  s(r,z)*s(r.z). 
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The  maximum  likelihood  approach  m  the  normal -mode  case:  From  the 
matrix  R=<Q£>*>,  we  obtain  the  ML  estimate^ 

ni(r.z)=  1/  £(r,z)*R''p(r.z)  .  (8) 

Since  R=Uss*U*+  UQLf  *  el  ,  it  follows  that,  if  the  rank  of  o  is  hi,  the 
eigenvalues  Xn  of  R.  for  n=tl*l . N,  satisfy  Xnrs  and  that  the  associated 

eigenvectors  ^  have  the  property  that  iTx^O  ;  for  n=  1 . fl  Xn>e. 

Writing  R  =  XLX\  where  the  columns  of  X  are  the  orthonormal 
eigenvectors  x^,  and  L  is  the  diagonal  matrix  of  eigenvalues  of  R,  that  is 

l  =diag{X ] . XN),  we  can  rewrite  (8)  as 

N  N 

ML(r,z)=  1/  IXn“!  l^r.z/xj2  =  >  /  I  V 1  |  s(r.z)*u*xr||  2  .  (9) 

n= I  n=  I 

Because  of  the  reciprocal  weighting  by  Xn-1  those  terms  corresponding 

to  the  lowest  eigenvalues  contribute  most;  that  is,  the  sum  is 
essentially  over  n=M+l . N.  Because,  for  these  n.  =0  for 

(r q.Zq)  associated  with  the  true  source,  we  might  expect,  as  in  the  usual 

case  of  hll  estimation,  to  discover  the  value  of  (^.Zq)  by  evaluating 

ML(r.z)  and  looking  for  the  largest  value.  However,  each  of  the  terms  in 

(9)  corresponding  to  n=n+ 1 . N  is  zero,  for  all  (r.z).  It  follows  that  the 

I'lL  estimator  will  show  a  large  response  for  each  (r.z),  not  only  for  the 
(tq.Zq)  corresponding  to  the  actual  source.  The  resulting  ambiguity 

surface  will  be  essentially  uniformly  "white”. 

It  is  important  to  recall  that  this  failure  of  the  ML  estimator 
occurs  when  the  noise  component  is  essentially  modal,  that  is,  of  the 
form  UQU*.  as  would  be  the  case  for  a  low-loss  bottom;  if  there  is  no 
noise  component  of  this  form  then  the  ML  procedure  should  work  better. 
When  the  modal  noise  is  present  the  situation  is  analogous  to  that  of 
spatially  concentrated  planewave  noise  and  the  solution  we  shall 
consider  next  is  similar  to  the  "sector -focused  stability"  fr/S)  method 
given  in  [  i. 


Reduced-dimension  flL  method  for  the  normal-mode  case 


The  matrix  U  is  N  by  n  ana  induces  a  linear  transformation  from  the 
space  of  complex  M-dmnensiona!  vectors.  Cf1.  into  the  space  of  complex 
N-dimensional  vectors,  CN;  we  shall  assume  that  U*U  is  invertible,  which 
is  equivalent  to  U  being  one-to-one,  or  to  U*  being  onto  C^.  Every  x  in 

M 

can  be  written  uniquely  as  a  sum  xrUv  ♦  w  .  for  some  v  in  C  and  w  m 
CN  such  that  lTw=0;  in  fact,  y=(U*U)"  !U*x  and  wrx-Uv. 

The  riL  method,  because  it  relied  on  the  eigenvectors  of  R 
associated  with  the  lowest  eigenvalues,  failed  when  those  eigenvectors 
x^,  n=M* 1  ,...,N,  had  the  property  U*xf)z0.  To  obtain  a  data  adaptive, 
nonlinear  method  that  works  in  the  presence  of  modal  noise  we  need  to 
rely  on  vectors  x  with  the  property  that  Q.(cq.Zq),’xzQ.  but  such  that 

Ct(r.z)+x^0  for  other  values  of  (r.z);  in  particular,  we  must  not  have 
U+x-0,  or  even  near  zero.  One  way  to  prevent  this  behavior  is  to  require 
that  x  not  have  the  component  w  such  that  U*w=0;  that  is,  require  that  x 
have  the  form  x=Uv  for  some  y.  let  us  now  solve  the  following 
optimization  problem,  in  lieu  of  the  one  that  leads  to  the  lowest 
eigenvalue  of  R; 


Minimize  x*Rx  ,  subject  to  xx*=l  3nd  x-Uy  for  some  y 

The  vector  x  that  solves  this  problem  C3n  be  obtained  in  a  particularly 
simple  way:  define  the  M  by  M  matrix  T=(lTll)" 1  ' ,"‘L\  then 

x=U(U*U)~ ! g,  where  e  is  the  normalized  eigenvector  of  T  associated  with 
the  smallest  eigenvalue.  We  can  now  determine  (rQ.Zg)  by  searching  for 

the  zero  of  the  function  £(r,z)*x  .  or,  having  obtained  T,  we  can  calculate 
the  reduced  maximum  likelihood  (RML)  estimator: 

RhL(r,z)=  1/ g(r.z)*T' !a(r,z)  .  <}(r,z)  =  (U'U)’ 1  ^l/pfr.z)  .  (10) 

which  is  equivalent  to  a  ML  procedure  on  the  mode-filtered  R 

RML ( r.z)  =  1/ s(r.z)4|s(rg.ZQ)s(rQ.z,-))4,Oi’  's(r.z)  ( :  - 


Reduced  maximum  likelihood  method  for  the  general  case 


What  causes  difficulty  for  the  P1L  estimator  in  the  normal  mode 
case  is  that  the  modal  noise  resembles  actual  sources  and  that  its 
matrix,  UQU*.  has  rank  n  but  dimension  N.  The  ML  method  and  other  high 
resolution  methods  expect  the  lowest  eigenvalues  to  correspond  to 
eigenvectors  orthogonal  to  the  signal  component,  but  not  to  all  potential 
signals  as  well.  When  this  happens  the  ML  is  useless  and  the  reduced  ML. 
or  something  like  it,  required  .  y 

Let  us  suppose  that  Jhe  matrix  R.-has  rank  M,  but  dimension  N>M,  so 
that  it  has  the  form  R=UTU\  wherejyis  an  M  by  11  positive  definite 
Hermitian  matrix  and  U  is  some  N  by  M  matrix;  in  the  normal  mode  case 
it  consists  of  eigenfunction  samples,  but  in  general,  we  will  not  have  a 
model  for  U.  Assume,  for  now,  that  U  is  known  and  that  T  is  data 

Y 

dependent.  Writing  T=WW+,  where  W  is  an  d  by  11  matrix,  we  have 
R=UWW+U+=W*,  for  V=uw.  Because  R  has  rank  M  its  inverse  does  not 
exist,  so  ML  cannot  be  formed.  A  standard  trick  to  create  invertibility  is 
to  add  a  small  positive  quantity  to  the  main  diagonal  of  R:  in  the  norma! 
mode  case  above  this  was  done  through  the  el  term,  but  did  not  make  Ml 
usable.  Instead,  we  consider  replacing  R"'  by  R*.  the  pseudo- inverse  of 
R,  to  obtain  a  reduced  ML  estimator. 

The  pseudo- inverse  of  R  is  R*=V(V*V)~2\/*=U(U+U)~  ’t"  '(ifur'u* 
and  the  reduced  ML  estimator  is  _  f-w 


- -  *-  - -  /iss'tu') 

/ 

rml(9)=  i/£(e)*R*u(e)  =  i/s(e)*T"1s(e)  .  (12) 

where  s(6)r  (U+U)" 1  U*£.f ©);  this  agrees  with  the  RML  we  obtained  above 
in  the  normal  mode  case. 

In  general  we  may  not  have  a  model  for  U  and  may  need  to  form  R* 
some  other  way.  We  can  do  that  by  taking  R=XLX*.  R*=XL*X*.  where  L* 
is  defined  to  be  the  diagonal  matrix  with  entries  An~'.  provided  Xn  is 

not  too  close  to  zero,  and  0  otherwise.  Using  R*  in  (12)  gives  the  RML 
estimator  in  the  general  case.  '<  jz-  u  Y  U  * 
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Using  the  matrix  structure  to  obtain  environment  parameters 


jn  [  i  Buckingham  and  Jones  discuss  the  use  of  the  angular 
distribution  of  noise  power  received  by  a  vertical  array  to  estimate  the 
critical  grazing  angle  of  the  bottom,  and  hence  the  compressions!  sound 
speed  in  the  bottom  sediment.  This  is  an  example  of  3n  inverse  problem, 
in  which  environmental  parameters  are  estimated  by  comparing  measured 
data  with  theory. 


In  another  paper  [  1  Buckingham  considers  the  structure  of  the  modal 
noise  component  UQU\  for  the  case  of  a  range- independent  isovelocity 
channel  with  a  low-loss  bottom,  where  the  noise  energy  is  uniformly 
distributed  immediately  beneath  the  surface.  The  interesting  point  made 
in  [  ],  from  our  perspective,  is  that  the  matrix  Q  is  essentially  a 
multiple  of  the  identity  matrix.  Knowing  that  the  theory  requires  this 
and  having  measured  R=UQU4  in  the  noise-only  case,  we  could  estimate 
the  matrix  U;  for  each  choice  U  form  Q=(U4U)" 'u+RU(U+U)” if  we  have 
chosen  the  correct  U=U  then  Q=Q=c*I.  whereas  for  wrong  choices  of  0  the 
matrix  Q  need  not  be  diagonal. 


An  interesting  object  of  study  would  be  those  noise  fields  for  which 
the  matrix  Q  is  diagonal,  is  it  generally  true  in  normal  mode 
environments  or  are  such  noises  special  cases7  It  is  not  a  statistical 
phenomenon  but.  in  the  case  considered  in  [  ]  at  least,  follows  from  the 
nature  of  the  eigenfunctions  Urn(z)  and  the  manner  in  which  the  sources 

of  independent  noise  energy  are  distributed  spatially.  The  analysis  in  [  ] 
Coes  invoke  a  "large  IT  approximation  (p.  1  188,  (10)).  so  leaves  open  the 
question  of  whether  or  not  Q  is  diagonal  for  small  H. 


On  entropy  criteria  for  solving  inverse  problems  with  positivity 

constraints 


by 

Charles  L.  Byrne  and  Lee  K.  Jones 
Department  of  Mathematics 
University  of  Lowell 
Lowell,  Massachusetts  01654 


ABSTRACT 


We  are  concerned  here  with  suitable  criteria  to  use  for  the 
reconstruction  ot  positive  functions  from  limited  data.  We  shall  assume 
that  the  integral  of  the  function  (call  it  R(x'i)  over  its  (compact)  support 
is  known,  so  that,  by  rescaling  to  have  integral  one,  we  may  assume  that 
the  function  to  be  recovered  is  a  (measurable)  probability  density 
function.  A  general  procedure  for  reconstruction  is  to  select  a  prior 


estimate  of  F; (;■■:)  (call  it  P(:-:))  and  then  to  accept  as  the  estimate  of  R(x) 
that  data  consistent  density  Q(x)  that  tj  closest  to  POO,  in  some 
appropriate  measure  of  distance.  The  distances  D(P,Q)=jf(Q(x),P(x))dx  , 
where  f(y.z)  is  a  suitably  constrained  function  of  y,z>0,  provide  a  wide 
class  of  reconstruction  procedures,  which  is  the  topic  of  this  paper.  A 
special  case  of  such  reconstruction  is  the  "minimization  of  cross 


entropy"  method  (MCE). 


The  method  of  minimization  of  cross  entropy  (MCE),  implicit  in  the 
work  of  Shannon  [  1 1,  and  advocated  by  numerous  authors,  including 
Jaynes  (2,31,  was  proposed  by  Kullback  (41,  who  called  it  the  “principle  of 
minimum  directed  divergence".  The  term  "cross  entropy"  is  due  to  I.J. 
Good  [5].  The  MCE  method  has  been  studied  extensively  by  Shore  and 
Johnson,  who  have  derived  the  principle  from  axioms  of  consistent 
inference  to]  and  have  used  the  resulting  reconstruction  method  in  speech 
processing  and  spectrum  analysis  (?]. 


Although  the  axiomatic  derivation  of  the  MCE  method  found  in  [61  is 
based  on  probability  theory,  it  is  finding  application  in  the  reconstruct¬ 
ion  of  essentially  non-probabilistic  functions  (which  happen  to  have  the 
mathematical  properties  of  probability  density  functions),  such  as  energy 
distribution  as  a  function  of  bearing  (array  processing),  x-ray  attenuat¬ 
ion  functions  (in  tomography),  and  non-negative  images  (in  optics).  The 
properties  that  the  reconstruction  will  exhibit  in  these  cases  are  not 
easily  predicted  from  the  axioms  of  logical  inference  from  which  the 
method  is  derived.  The  basic  problem  is  one  of  approximation:  we  wish 
to  use  the  data  and  prior  information  to  construct  a  function  that  will 
approximate  the  desired  function  in  some  appropriate  sense.  The  MCE 
method  employs  a  (non-syrnmetric)  measure  of  distance  between 
functions  that  obeys  an  orthogonality  principle  analogous  to  that 
associated  with  the  metric  of  Hilbert  space. 


In  [S]  it  was  shown  that  this  orthogonality  principle  serves  to 
distinguish  the  MCE  method  from  other  distance  minimization 
procedures  obtained  from  the  so-called  Ali-SiJvey-Csiszar  distances.  In 
this  paper  we  consider  the  extent  to  which  the  orthogonality  principle 
further  distinguishes  the  MCE  method  from  methods  based  on  a  much 
wider  class  of  distances. 

The  problem  is  to  reconstruct  the  non-negative  measurable  function 
defined  on  a  compact  set  X  within  d-dimensional  real  Euclidean 
space,  knowing  only  the  linear  functional  values  rk,  k=0,...,K,  given  by 


r[(  =  [  R(::)  gk(x)  dx  , 


where  the  gk(x)  are  known  linearly  independent  bounded  measurable 

functions.  We  assume  that  g0(x)=  1  for  all  x,  so  that  r0=  area  of  R(x), 

which  we  then  take  to  be  equal  to  one.  We  let  i?  he  the  collection  of  all 
measurable  probability  density  functions  supported  cm  X  for  which  (1) 
holds.  A  member  of  Q  is  called  sdmissWte  if  it  is  bounded  above  and 
away  from  zero  on  X.  We  assume  that  we  have  available  a  prior  estimate 
of  R(x)  in  the  form  of  a  measurable  probability  density  function  P(x), 
supported  in  X.  To  obtain  our  reconstruction  (estimate)  of  R(x)  we  select 
that  Q(x)  in  Q  closest  to  P(x)  in  some  appropriate  sense. 

To  measure  closeness  we  employ  non-negative  distances  of  the 

form 


D(Q,P)  =  J  f(Q(x),P(x))  dx  ,  (2) 

where  f(y,z)  is  a  suitably  smooth  kernel  function.  Associated  with  a 
particular  distance  is  the  following  optimization  problem: 

Problem  A:  Find  Q  in  Q  such  that  D(Q,P)<  D(Q,P),  for  all  Q  in  Q. 

For  example,  the  MCE  method  employs  the  kernel  f(y,z)=  y  logOy/z), 
and  D(Q,P)  becomes  the  cross  entropy  of  Q,  given  P: 


E(Q,P)  =  \  Q(x)  log[Q(x)/P(x)J  dx  ; 

a, 

the  MCE  solution  is  that  Qr„|££(x)  in  Q  for  whit 


for  which  (3)  is  minimized. 


r.  r.  -  .  T,  r,  r, r,*.  r,  - ,- c-  ^  ■.  H.-  W  '-'  V  ». 


2.  The  orthogonal itu  pri net d > e  for  the  MCE  method 

a 

It  may  be  that  there  is  no  member  of  Q  for  which  E(Q,P;  is  finite; 
suppose  the  support  of  P  is  not  all  of  X  and  g j (x)  is  1  for  those  k  in  X, 

but  outside  the  support  of  P,  while  0  otherwise,  and  that  rj  =  1.  However, 

if  there  is  one  Q  in  i?  for  which  E(Q,P)  is  finite  the  the  MCE  solution 
evicts  and  has  the  form 

Qj'iCE(;0=  Ptx)  exp(  a^  g^tx)  +...+  aj.  g^fx)  )  ,  (4) 

for  x  in  N,  arid  zero  otherwise,  where  N  is  the  union  of  the  supports  of 
all  Q  in  0  for  which  E(Q,P)  is  finite.  A  rigorous  proof  of  this  was  first 
given  by  Csiszar  [9]  and  a  more  restrictive  sufficiency  version  was 
proved  earlier  by  Kul  1  back  [10]. 


Assume  now  that  the  support  of  P  equals  X,  so  that  the  MCE  solution 
exists  and  has  support  X.  Let  7  be  the  set  of  all  admissible  densities 
having  the  form 


T(x)  -  P(x)  exp(  t0  g^(x)  +  ...  +  tK  gK(x)  ) ,  x  in  X.  (5) 

Since  |  T(x)  dx  =1  there  are  only  K  free  parameters  arid  tn  =  tn(t  ^  ,...,tjr) 
is  a  function  of  the  other  parameters;  specifically, 


t0  =  -log  { f  PM  exp(  t,  gp. 
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The  following  theorem  is  the  orthogonality  principle  of  interest 
Theorem:  The  choice  T(x)=QM^^(x)  minimizes  E(R,T)  over  all  T  in 


here: 


7. 


proof:  It  is  easily  shown  that  E(RfT)=E(Qj*iCE,T)  +  E(R.P)  -  E(Qmq^jP)?  so 
that  E(R,T)  and  ECQ^p.T)  are  simultaneously  minimized.  But  it  follows 

from  elementary  properties  of  the  function  f (y.z)=y  log(y/z)  that 
E(QMrT,Q^,-r)c  E(Q^Ce»T),  with  equality  if  and  only  if  T=QMCc. 
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In  tills  paper  we  are  concerned  with  characterizing  those  distances 
!aQ,P)=  |  * (Q(.  x).P(x))  dx  for' which  the  analogous  orthogonality  principle 
holds. 

3.  The  generalized  orthogonality  principle 

If  P  is  an  admissible  density  and  the  Q(x)  in  Q  minimizing  D(Q,P)  v. 
also  admissible  then  it  can  be  shown  that  the  Euler-Lagrange  equation 
must  be  satisfied;  that  is,  fur  all  x  we  have 

f  1  (Q(x),P(x))  =  aQ  g0(x)  +  ...  +  aK  gK(x) ,  (7) 


for  some  choice  of  constants  aQ,...,ap  where  f  j(y,z)  is  the  partial 
derivative  of  f  with  respect  to  the  first  variable. 

We  now  generalize  the  orthogonality  principle  by  defining  the  class 
7  now  to  be  all  admissible  densities  T(x)  that,  for  some  constants 
t,'j,  t|/,  satisfy  the  equation 

f  j(T(x),P(x))  =  tQ  Qq(x)  + ...  +  tj/  g}/(x)  ,  for  all  x.  (8) 

We  then  consider  a  second  problem  associated  with  the  distance  D: 
Problem  B:  Minimize  D(R.T)  over  all  T  in  7. 


The  distance  measure  o(Q,P)  is  then  said  to  exhibit  the  criPcpcnc/ftii 
principle  if  whenever  Problem  A  has  unique  solution  Q(x)  then  it  is  also 
the  unique  solution  to  Problem  B. 

Example  1.  If  f(y,z)  =  y  log(y/z)  then  D(Q,P)=E(Q,P)  and  the  orthogonality 
principle  holds. 

Example  2.  If  f(y,z)  =  (y-z)^/2  then  Q(x)=  P(x)  +  a0  g0(xK..*aK  gK(x) 

follows  from  (8);  although  the  orthogonality  principle  holds  there  may  be 
no  positive  member  of  Q  of  this  form. 
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txampie  3.  If  uy.zj  =  y  e(z/y/,  where  evtj  is.  tnree  times  continuously 
differentiable  and  a^e/ dt~  >0,  then  the  distance  D(Q,P)  is  in  the  smooth 
Ali-Silvey-Csiszar  class;  it  was  shown  in  (SI  that  only  £(Q,P)  satisfies  . 
the  orthogonality  principle  in  this  class.  yy>jK;^jhi 

We  begin  now  to  extend  this  result  of  (a)  to  a  wider  class  of 
distances  associated  with  more  general  functions  f(y,z). 

A.  Regularity  assumptions  and  admissible  kernels 

We  consider  a  general  distance  measure  D(Q,P)=j  f(Q(x),P(x))  dx, 
and  impose  conditions  on  the  function  f(y,z)  in  order  that  the  distance  D 
have  reasonable  metric  properties. 


condition  v.  T ^ Cy*y>  =  c  =  constant,  for  y>0  ; 

we  want  D(Q,P)>B(Q,Q)  always  and  the  Euler-Lagrange  equation 
corresponding  to  minimizing  D(Q,P),  subject  only  to  Q>0  and  j  Q  =  1  ,  is 
f !  (P(x),P(x))=constant. 

Condition  2:  f(y,y)  =  0  for  y>0; 

we  want  D(P(x),P(x))=0  for  all  P(x). 

If  we  let  h(y)=f(y,y)=0  then  dh/dy  =0.  But  dh/dy=  fj(y,y) +  f2(y«y) $ 
it  follows  from  Conditions  1  and  2  that 

Condition  5:  fo(y,y)  =  -c  ,  y~-0  . 

Our  fourth  condition  is  that  f(y,z)  be  strictly  convex  in  the  first 
variable;  that  is, 

Condition  4:  f  j  j(y,z)>0  ,  for  all  y,z>0  ; 
we  arrive  at  this  condition  by  examining  what  happens  to  D(Q,P)  for  Q 


near  Q.  Let  x,-,  be  a  fixed  point  in  X  and  let  U  be  a  neighborhood  of 

Then  let  h(x.)  be  any  continuous  function,  supported  on  U  ,  and  not  in  the 
linear  span  of  the  functions  X.y  gk(x),  k=0,...,K,  the  restrictions  to  U  of 

the  functions  g^(x).  Let  j(x)  be  the  prolection  of  h(x)  onto  the  span  of 

the  Xy  9k(x)  and  let  rn(x)=  h(x)-j(x);  then  f  m(x)  g^Cx)  dx  =  0,  k=0,...,K. 

The  functions  Q(x)  +  srn(x)  are  data  consistent,  for  all  e>0.  We  want  the 
minimum  of  J  f(Q(x)+sm(x),P(x))  to  occur  at  s=0;  the  second  derivative, 
with  respect  to  s,  is  J  f  j  j(Q(x)+em(x),P(x))  m(x)~  dx.  This  must  be 

positive  at  s=0,  for  every  neighborhood  IJ,  no  matter  how  small,  and  for 
every  point  Xq;  from  our  freedom  to  select  the  problem,  hence  the  Q(x) 

and  F'(x),  it  follows  that  Condition  4  must  hold. 


Summarizing,  we  say  that  f(y,z)  is  an  stfmfssffi/e k&msZ  if  the 
following  conditions  hold:  f  | (y,y)=-f 0(y,y)=c  ;  f(y,y)=Q,  f ;  1  (y,2)>0,  for 

all  y,z>0. 


The  four  conditions  above  do  not  guarantee  that  the  Q(x)  satisfying 
the  Euler-Lagrange  equation  will  be  positive;  the.  function  f(y,z)  is  said 
to  be  p-stimissWie  if  the  equation  (7)  always  has  a  positive  solution  for 
Q(x). 


5.  Necessary  conditions  for  the  orthogonality  principle 

The  following  Proposition  (which  we  prove  in  Appendix  ,Y)  is  basic 
to  our  characterization  of  distances  D(Q,P)=j  f(Q(x),P(x))  dx  having  the 
orthogonality  principle: 

Proposition  i:  Suppose  that  f(y,z)  is  an  admissible  kernel  such  that,  for 
every  choice  of  R(x)  ,  the  gk(x)  and  the  prior  P(x),  the  orthogonality 

principle  holds  for  D(Q,P)=j'F(Q(x),P(x))dx.  Then  U, j  j(y,z)=0  for  all  y,z>0. 


I 
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From  the  equation  f 2 1  \ (y.z)=0  and  our  conditions,  it  follows  that 
i'(y,:)  has  the  form 


f(y,z)  =  cy  -yj(z)  +  J(y)  -cz  +  zj(z)  -  J(z) ,  (9) 

for  some  function  strictly  convex 'j(z),  with  dJ/dz=  J'(z)=j(z).  The 
vj**  convexity  of  J(z)  implies  that  the  function  ffy,z),  given  by 


f(y,2)  =  -yj(z)  +  Jfy)  +  zj(z)  -  J(z)  ,  (10) 

is  positive.  It  also  follows  that  if  f(y,z)  has  form  (9)  then  the  necessary 
condition  (7)  becomes 


f  j(Q(x),P(x))  =  c  +  j(Q(x))  -  j(p(x))  =  0. 


(11) 


In  order  that  f  be  p-adrnissible  it  is  necessary  and  sufficient  that  the 
strictly  increasing  function  jfy)  map  the  positive  reals  onto  the  entire 
real  line;  we  make  this  another  condition: 


Condition  5:  The  range  of  jfy),  for  y>0,  is  the  entire  real  line. 

Example  1:  Let  j(z)=  log(z),  so  that  J(z)=  zlog(z)  -  z  ;  then  condition  5 
is  satisfied.  For  c=1  we  have  f(y,z)=  ylogfy/z),  so  the  distance  is  cross 
entropy. 


Example  2:  Let  j(z)  =  -1/ 
satisfied.  For  c=0  we  oet 

v 

all  x,  then  the  resulting  D 


z  ,  so  that  J(z)=  -logfz);  then  condition  5  is  not 
ffy,z)=fy/z)  -  logfy/z).  If  p(x)=  constant,  for 
f Q,P)  is  equivalent  to  negative  Burg  entropy, 


-j'log(Q(x))dx.  The  Euler-  Lagrange  equation  in  this  C3se  leads  to  Q(x)= 
l/fa0+...+aK  gK(x)),  and  the  orthogonality  principle  says  that  the 


"maximum  entropy  method"  (MEM)  solution  is  the  closest  to  R,  among  all 
density  functions  of  its  form,  in  the  sense  of  minimizing  the  distance 
D(.R,T)=  JKR/T)  -  logfR/T)].  We  shall  return  to  the  MEM  later. 


J  -»  J  J  '  tThi^ii 
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6.  Characterizing  the  MCE  method 

We  hove  seen  that  (9)  is  necessary  (along  with  our  four  conditions) 
fur  D(Q,P)  to  have  the  orthogonality  principle  generally.  In  fact  the 
converse  is  also  true: 

Proposition  2:  If  f(y,z)  satisfies  (9)  for  all  y,z>0  then  D(Q,P)  has  the 
orthogonality  principle. 

proof:  Let  Q(x)  be  the  unique  solution  to  Problem  A.  The  Euler-Lagrange 
e q u a t i o n  i s  n o w  j ( Q ( k ) ) - i ( p ( x ))=  a a ^ g K- ( x )  s o  t h e  c o nditi o n  defi n i n g 

the  class  T  is  j(T(x))-j(P(x))=  t^+...+t|/g^(x)  for  some  constants 

Since  Q  is  data  consistent  it  follows  that  D(R,T)=D(Q.T)+D(R.F')-D(Q.P), 
from  'which  we  conclude  that  D(R,T)  and  D(Q,T)  are  simultaneously 
minimized.  But  B(Q,Q)<D(Q,T)  unless  T=Q. 


We  see  from  (9)  that  f(y,z)  contains  terms  that  involve  only  z;  these 
terms,  when  they  appear  in  D(Q,P),  will  involve  only  the  prior  P,  and  not 
the  unknown  Q.  It  would  seem  artificial  if  D(Q,P)  has  the  orthogonality 
principle,  but  the  distance  obtained  from  D(Q,P)  by  omitting  the  terms 
that  do  not  involve  Q  does  not.  In  fact,  we  can  characterize  the  MCE  in 
precisely  these  terms;  t'(y,z)=ylog(y/z)  is  essentially  the  only  function 
of  the  form  (9)  having  no  terms  involving  only  the  variable  z. 

Proposition  7:  If  f(y,z)  has  the  form  (9)  and  zj(z)  -J(z)  -cz  =  0  for  all  z 
then  D(Q,P)=jf(Q,P)  is  equivalent  to  E(Q,P). 
proof.  From  the  differential  equation  for  J  it  follows  easily  that 
•J(z)=a(zlog(z))  +  bz,  for  some  constants  a  and  b. 

Proposition  4:  If  D(Q,P)=B(P,Q)  has  the  orthogonality  principle  then 
D(Q,P)  is  equivalent  to  least  squares:  that  is,  to  the  distance  (Q-P)“: 
hence  there  is  no  symmetric  distance  satisfying  condition  5  that  obeys 
the  orthgonality  principle. 

proof:  The  symmetry  condition  and  (9)  imply  that  J(y)  satisfies  the 
following  differential  equation,  for  each  fixed  z: 


k(y)-(2/(y-z))J(y)  =  -(2/(y-z))J(z)-J'(z)  . 


(\r{ 


It  then  follows  that  JCy)  is  quadratic  in  y. 


In  l  ]  Burn  considers  the  problem  of  reconstructing  the  positive 
power  spectral  density  function  Rico),  |  00  |  in,  from  the  finitely  many 
values  of  its  Fourier  transform, 

Tt 

rn  =  {  R(w)  exp(-inw)  dv/2n  ,  |n|<N  .  ( !Z) 


The  MEM  he  proposes  adopts,  as  the  estimate  of  F;(co),  that  positive 
function  Q(ojO=  QmEI satisfying  the  constraints  (12)  for  which  the 

Burg  entropy,  }log(Q(cL'0)dcu,  is  maximized.  The  Euler-Lagrange  equation 
says  that  Q^c^(o>)  must  have  the  form 


^MEM^)  =  i/  I  bn  exp(in<x>)  ,  |o>|irt, 


vhere  the  bn  are  chosen  so  as  to  satisfy  (1  'pi. 


It  does  not  follow  from  (12)  that  there  will  be  a  positive  solution. 
However,  Burg  proceeds  by  assuming  that  there  is  a  positive  solution  of 
the  form  (l£),  and  uses  the  Fejer-Riesz  theorem  ([  ],  p.  231)  to  rewrite 
the  solution  in  the  form 

*  „  < 

QMEri(<D)=  3q/  |  2  an  expOnto)  |~,  | [ in,  (I/O 


where  the  polynomial  A(z)=  aQ+a  jZ+...+af,,jZf'J  has  all  its  roots  outside  the 
unit  circle.  He  then  shows  that  the  vector  a=(aQ,...,a^)"’'  must  satisfy  the 
matrix  equation  Ra=£  where  R  is  the  W+I  by  M+  (  matrix  with  entries 
Rm,n=rn-rn  >  m,n=  1  ,...,N+ 1,  and  S=(1,0,...,0)T.  So  far  everything  is  based 
on  the  assumption  that  a  positive  solution  exists. 


But  now  Burg  shows  that,  given  the  matrix  R  obtained  from  the  data, 
the  a=R“  must  be  such  that  A(z)  has  all  its  roots  outside  the  unit 
circle  (that  is,  a  has  the  minimum phsse property) >  from  which  it 
follows  that  the  in  ( \A)  is  data  consistent.  So  he  has  succeeded 


in  producing  a  positive,  data  consistent  function.  The  question  still 
remains:  Does  this  actually  maximize  the  Burg  entropy,  among 

the  class  of  data  consistent  power  spectral  densities?  . 

That  it  does  maximize  the  entropy  follows  from  a  consideration  of 
the  relationship  between  the  entropy  and  the  error  of  one-step  prediction 
of  a  time  series  from  knowledge  of  its  infinite  past  (see  Papoulis  [  ]. 
p.427). 

We  noted  earlier  that  f(y,z)=(y/z)-log(y/z)  gives  the  ME11,  but  the 
j(y)  fails  to  have  condition  5.  This  suggests  that,  in  general,  the  MEM 
formalism  may  not  lead  to  a  positive  solution:  this  is  the  case,  for 
example,  when  R  is  defined  on  higher  dimensional  space,  or  when  the 
functions  gk  are  not  simply  one-dimensional  exponential s. 

It  also  causes  difficulty  if,  in  the  problem  considered  by  Burg,  the 
support  of  R(o>)  is  [-ft, ft],  instead  of  l-n,n],  where  0<ft<7t.  The 
polynomial  in  02)  need  not  factor  as  before;  all  that  the  Euler-Lagrange 
equation  requires  is  that  the  solution  have  the  form  (]%  over  [-ft, ft],  so 
the  polynomial  can  be  negative  for  some  values  of  w.  There  is  no 
guarantee  that  the  function  of  the  form  02)  that  is  data  consistent  will 
be  positive  within  [-ft, ft]. 

Burg's  maximum  entropy  method  (MEM)  for  reconstructing  a  positive 
function  from  Fourier  transform  values  is  justified  as  a  method  for 
power  spectrum  estimation  by  appealing  to  the  limiting  expression  for 
the  multivariate  entropy  of  time-domain  samples  of  a  Gaussian  random 
process.  As  we  have  just  seen,  the  maximization  of  flog(Q(x))dx,  subject 
to  the  d3ta  constraints,  corresponds  to  the  use  of  j(z)=-l/z  in  (9),  hence 
has  the  orthogonality  principle.  The  Euler-Lagrange  equation  shows  that 
the  MEM  solution  has  the  form  Q(x)=  l/{aQ+...+aK  gK(x)}  ,  with  the  ak 

chosen  so  as  to  make  Q  data  consistent.  From  Proposition  2  we  know 
that,  whenever  Q(x)  exists,  it  provides  a  minimum,  among  all  densities 
of  its  form,  for  the  distance  J[(F!/T)  -log(F!/T)]  dx  .Thus  Q(x)  is  close  to 
R(x)  in  an  appropriate  sense  arid  the  MEM  is  justified  without  appeal  to 
G  a  u  s s i a n  p r o  cesses. 


»'  V  •'  /  V 
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minimization  of  3  distance  D( Q.P  .)= t f (Q(x),P(x))dx,  subject  to  Cj  satisfying 
data  constraints,  where  P(x)  is  a  prior  estimate  of  the  positive  function 
to  be  recovered.  We  have  limited  the  discussion  to  those  f(y,z)  satisfying 
certain  conditions  chosen  to  endow  D(Q,P)  with  properties  suitable  for 
approximation.  The  MCE  method,  based  on  the  choice  of  f(y.z;=yiog(y/z;. 
obeys  3n  orthogonality  principle  analogous  to  that  associated  with 
orthogonal  linear  projection  in  Hilbert  space,  and  our  objective  here  was 
to  investigate  the  extent  to  which  this  orthogonality  principle  serves  to 
characterize  MCE  among  distances  of  the  above  form.  Such  a 
characterization  would  then  provide  a  purely  approximation  theoretic 
justification  for  the  MCE  procedure. 

We  have  shown  that,  in  order  for  the  distance  D(Q,P)=jf(Q(x),P(;;))dx 
to  obey  the  orthogonality  principle  (as  well  as  the  other  conditions)  it  is 
necessary  that  f(y,z)  have  the  form 

t 

f(y,z)=  cy  -yj(z)  +  J(y)  -cz  +zj(z)  -  J(z),  y,z>0,  (If) 

where  J(y)  is  a  strictly  convex  function  with  derivative  j(y),  and  c  is 
some  constant.  Terms  involving  z  only  play  no  role  when  D(Q,P)  is 
minimized  as  a  function  of  Q;  if  those  terms  vanish  identically  then  the 
distance  D  is  equivalent  to  cross  entropy. 


Appendix  i :  proof  of  Proposition  1 

If  we  formally  differentiate  the  equation  (7)  defining  the  class 
with  respect  to  the  variable  t(,.  for  some  k=l....,K,  we  obtain 


(9T/3tk)f  1 1(T(x),P(x))=  Ot0/dtk  +  gk(x)  . 


Fix  x  and  consider  (l£)  as  a  partial  differential  equation  for  tfje  functi 
T=T(x:tl'),...,t|>-),  viewed  as  a  function  of  tQ,...,t^.  We  rewrite  Cfj)  as 

3T  /  3 1  k  =  ( 3  1 0  /  3 1  k  +  g  k  ]  h  (T ) ,  07) 

where  h(y)  =  h(y;x)=  1  / f ,  j(y,P(x)).  Let  G(y)=G(y;x)  be  the  (increasing) 

anti-derivative  of  the  function  h(y)  having  the  "constant  of  integration' 
such  that 

K  ci 

G(Q(x))=  2  gk(x) .  020 


'hen  the  function 


T(v;t0,...,tK)  =  G~  (  o  tk  gk(x)  ) 


solves  the  partial  differential  equation  Oip),  is  equal  to  Q(x)  when  the 
tk=ak  and  is  bounded  above  and  away  from  0,  uniformly  in  x,  for 

t=(t i ,...,t(/)  in  a  neighborhood  of 

For  t  in  some  neiahborhood  of  a  the  function  T(x;tA . t(/)  given  by 

ij  ^  K 

(w*?)  will  be  a  probability  density  function,  hence  the  problem  of 
minimizing  D(R,T)  over!  in  T  can  be  viewed  as  that  of  minimizing 
D(R,T(t))  subject  to  the  constraint  f  T(t)  =  JT(x;t)dx  =1.  Forming  the 
Lagrangian  we  obtain 

■z  I 

0=  3/3 tk  {  D(R,T(t))  -  X  j  T(t)  1  (56) 

or,  differentiating  under  the  integral, 

0=  [{  [^(ROO.CKx))  -X.][3t0/3tk  +  gk(x)l  /f  j  |(Q(x),P(x))J  dx  .  (5-1) 


From  !  i  it)  -)  it  i 01  lows  that  (for  t  near  a.  so  that  i  is  in  T) 


'i-  1  f  i  i  t'i  \  -  l  [Ht  .  //■it.  *  n.  1/f  .  .  i'T  P'i 


H 


so  that 

0=  \  {  [f.-,(R(x),Q(x))][(3t0/3tk)  |  t=d  +  gk(x)]/f  j  |(Q(x),P(x))}  dx.  (23) 
u 

Equation  (2?)  holds  it  we  replace  R(x)  with  any  other  pdf  that  is  data 
consistent.  If  Xq  is  any  point  in  X  and  U  is  a  neighborhood  of  Xq  we  can 

find  a  bounded  measurable  function  t(x),  supported  on  U,  and  orthogonal 
to  each  of  the  q.„  so  that  S(x)=  R(x)+sr(x)  is  a  data  consistent  pdf,  for 

s>0  small  enough:  then  (23)  holds  with  5(x)  instead  of  R(x). 


Replacing  R(x)  with  S(x)  and  then  differentiating  twice  with  respect 
to  s  gives 

0=  |{f21 1  (SCx),Q(x))(xr(x))2[Ot0/atk)  |  t=a  +  gk(x)]/f  n(Q(x),F'(x))}  .  (24) 


The  integral  is  over  the  neighborhood  U,  since  r(x)  is  zero  off  IJ. 


Mow  let  z  -  Q(xq)  be  some  value  in  the  range  of  Q,  and  y>0.  By 
adding  to  R(x)  a  narrow  Gaussian  density  centered  at  x=Xq  and  then 
correcting  by  a  linear  combination  of  the  gk  we  C3n  assume  that  R(xq)=u. 
If,  in  the  neighborhood  U  of  >:0,  the  function  (Ot^/ot^)  |  +  gk(x)]  does 

not  change  sign,  then,  by  shrinking  the  neighborhood  about  x=Xq  and 
letting  s  go  to  zero  in  S(x),  we  can  conclude  that  f 2 ^  ](y,z)=0. 


Since  we  are  interested  in  distances  D(Q,P)  that  will  be  applied  to  a 
variety  of  problems  we  are  free  to  tailor  the  problem  to  suit  the 
technical  assumptions  needed  above.  In  particular,  we  can  assume  that 
the  x  are  one-dimensional,  that  the  gk  are  continuous  in  a  neighborhood 


of  Xq  and  assume  a  given  value  only  finitely  many  times.  So,  except 
perhaps  for  finitely  many  values  of  Xq,  the  above  argument  holds  and 
f.^i  ] (y,z)=0  for  z=Q(xq)  and  any  y>0.  But  this  must  be  true  for  all 


must  be  identically  zero, 


problems,  so  the  conclusion  then  is  that  , , 
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